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Learning And-Or Model to Represent Context and 
Occlusion for Car Detection and Viewpoint Estimation 

Tianfu Wu*, Bo Li* and Song-Chun Zhu 


Abstract —This paper presents a method for learning an And-Or model to represent context and occlusion for car detection and 
viewpoint estimation. The learned And-Or model represents car-to-car context and occlusion configurations at three levels: (i) 
spatially-aligned cars, (ii) single car under different occlusion configurations, and (iii) a small number of parts. The And-Or model 
embeds a grammar for representing large structural and appearance variations in a reconfigurable hierarchy. The learning process 
consists of two stages in a weakly supervised way (i.e., only bounding boxes of single cars are annotated). Firstly, the structure of the 
And-Or model is learned with three components: (a) mining multi-car contextual patterns based on layouts of annotated single car 
bounding boxes, (b) mining occlusion configurations between single cars, and (c) learning different combinations of part visibility based 
on CAD simulations. The And-Or model is organized in a directed and acyclic graph which can be inferred by Dynamic Programming. 
Secondly, the model parameters (for appearance, deformation and bias) are jointly trained using Weak-Label Structural SVM. In 
experiments, we test our model on four car detection datasets — the KITTI dataset [T], the PASCAL VOC2007 car dataset (^, and two 
self-collected car datasets, namely the Street-Parking car dataset and the Parking-Lot car dataset, and three datasets for car viewpoint 
estimation — the PASCAL VOC2006 car dataset (^, the 3D car dataset (^, and the PASCAL3D+ car dataset 0. Compared with 
state-of-the-art variants of deformable part-based models and other methods, our model achieves significant improvement consistently 
on the four detection datasets, and comparable performance on car viewpoint estimation. 

Index Terms —Car Detection, Car Viewpoint Estimation, And-Or Graph, Hierarchical Model, Context, Occlusion Modeling. 

- > - 


1 Introduction 

1.1 Motivation and Objective 

C AR is one of the most frequently seen object category in 
every day scenes. Car detection and viewpoint estima¬ 
tion by a computer vision system has broad applications 
such as autonomous driving and parking management. 
Fig-E shows a few examples with varying complexities in 
car detection from four datasets. Car detection and view¬ 
point estimation are challenging problems due to the large 
structural and appearance variations, especially ubiquitous 
occlusions which further increase the intra-class variations 
significantly. In this paper, we are interested in learning a 
unified model which can detect cars in the four datasets 
and estimate car viewpoints. We aim to address two main 
issues in the following. 

The first is to explicitly represent occlusion. Occlusion is 
a critical aspect in object detection for several reasons: (i) we 
do not know ahead of time what portion of an object (e.g. 
car) will be visible in a test image; (ii) we also do not know 
the occluded areas in weakly-labeled training data (i.e. only 
bounding boxes of single cars are given, as considered in 
this paper); and (iii) object occlusions in testing data could 
be very different from those in training data. Handling oc¬ 
clusions entails models capable of capturing the underlying 
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Fig. 1. Illustration of varying complexities in car detection from four 
datasets, (a) The PASCAL VOC2007 car dataset consists of single 
cars under different viewpoints but with less occlusion as pointed out in 
[ 5 ]. (b) The KITTI car benchmark [T] includes on-road cars captured by a 
camera mounted upon a driving car which have more occlusions but re¬ 
stricted viewpoints, (c) The Street-Parking car dataset [3 includes cars 
with heavy occlusions but less multi-car context and (d)Tne Parking-Lot 
car dataset 0 consists of cars with heavy occlusions and rich multi-car 
context. The proposed And-Or model is learned for car detection in all 
four datasets. 

regularities of occlusions at part level (i.e. different occlusion 
configurations). 

The second is to explicitly exploit contextual information 
co-occurring with occlusions (see examples in Figj^ (b), 
(c) and (d)), which goes beyond single-car detection. We 
focus on car-to-car contextual patterns (e.g., different multi¬ 
car configurations such as 2,3 or 4 cars), which will be 
utilized in detection and viewpoint estimation and naturally 
integrated with occlusion configurations. 

To represent both occlusion and context, we propose to 
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Fig. 2. Illustration of the statistical regularities of car occlusions and multi-car contextual patterns by CAD simulation. We represent car-to-car occlu¬ 
sion at semantic part level (left) and generate a large number of synthetic occlusion configurations (middle) w.r.t. four factors (car type, orientation, 
relative position and camera view). We represent the regularities of different combinations of part visibilities (i.e., occlusion configurations) by a 
hierarchical And-Or model. This model also represents multi-car contextual patterns (right) based on the geometric configurations of single cars. 


learn an And-Or model which takes into account structural 
and appearance variations at multi-car, single-car and part 
levels jointly. Our And-Or model belongs to grammar mod¬ 
els i El embedded in a hierarchical graph structure, which 
can express a large number of configurations (occlusion con¬ 
figurations and multi-car configurations) in a compositional 
and reconfigurable manner. Fig|^ illustrates our And-Or 
model. By reconfigurable, it means that we learn appearance 
templates and deformation models for single cars and parts, 
and the composed appearance templates for a multi-car con¬ 
textual pattern is inferred on-the-fly in detection according 
to the selections of their child single car Or-nodes. So, our 
model can express a large number of multi-car contextual 
patterns with different compatible occlusion configurations 
of single cars. Reconfigurability is one of the most desired 
property in hierarchical models, which plays the main role 
in boosting the performance in our experiments, and also 
distinguishes the proposed method to other models such 
as the visual phrase model and different object-pair 
models 1[TT|, 101, ig. 

1.2 Method Overview 

1.2.1 Data Preparation with Simulation Study 

Manually annotating car views, parts and part occlusions 
on real images are time-consuming and usually error-prone. 
One innovation in this paper is that we generate a large set 
of occlusion configurations and multi-car configurations by 
CAD models and a public^ available graphics rendering 
engine, the SketchUp SDK In the CAD simulation, the 
occlusion configurations and multi-car contextual patterns 
reflect variations in four factors: car type, orientation, relative 
position and camera view. We decompose a car into 17 seman¬ 
tic parts as shown in different colors in the left side of Fig-|^ 
We then generate a large number of examples by placing 3 
cars in a 3 X 3 grid (resembling the regularities of cars in 
parking lots or on the road, see the middle of Fig. |^. For 
the cars in the center, we compare their part visibilities from 
different viewpoints (as illustrated by the camera icons), and 
obtain the part occlusion data matrix (each row represents an 

1. we used 40 CAD models selected from www.doschdesign.com and 
Google 3D warehouse 

2. www.sketchup.com 


example and each entry takes a binary value, 0/1, repre¬ 
senting occluded or not for a part under a viewpoint). The 
data matrix is used to learn the occlusion configurations. 
Similarly, we learn different multi-car contextual patterns 
based on the geometric configurations (see some examples 
in the right side of Fig. [^. Note that the semantic part 
annotations in the synthetic examples are used to learn the 
structure of our And-Or model and the parts are treated 
as latent variables in weakly-annotated training data of 
real images. We do not evaluate the performance of part 
localization and instead evaluate the viewpoint estimation 
based on the inferred part configurations. 

In the simulation, we place 3 cars in a 3 x 3 grid with 
three considerations: (i) It can generate different occlusion 
configurations for the car in the center under different 
camera viewpoints, as well as different multi-car contextual 
patterns (2-car or 3-car pattern), which is easier than using 2 
cars in processing the data in simulation, (ii) It can generate 
the synthetic dataset in which the occlusion configurations 
and multi-car contextual patterns are generic enough to 
cover the four situations in Figj^ (hi) It can also reduce the 
gap between the synthetic data and real data when learning 
the initial appearance parameters for parts with the car in 
the back instead of the white background (see more details 
in Sec|5]). 

1.2.2 The And-Or Model 

There are three types of nodes in the And-Or model: an And- 
node represents decomposition (e.g., a car is composed of a 
small number of parts), an Or-node represents alternative 
ways of decomposition accounting for structural variations 
(e.g., different part configurations of a single car due to 
occlusions), and a Terminal-node captures appearance vari¬ 
ations to ground a car or a part to image data. 

Fig. [^illustrates the learned And-Or model. The hierar¬ 
chy consists of a layer of multi-car contextual patterns (top) 
and several layers of occlusion configurations of single cars 
(bottom). The overall structure is as-follows: 

i) The root Or-node represents different multi-car con¬ 
figurations which capture both viewpoints and car-to-car 
contextual patterns. Each multi-car contextual pattern is 
then represented by an And-node (e.g., car pairs and car 
triples shown in the figure). The contextual information 
reflect the layout regularities of a small number, N (e.g.. 
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Fig. 3. Illustration of our And-Or model for car detection. It represents multi-car contextual patterns and occlusion configurations jointly by modeling 
spatially-aligned multi-cars together and composing visible parts explicitly for single cars. (Best viewed in color) 


N G {2, 3}), of cars in real sitations (such as cars in a parking 
lot). 

ii) A multi-car And-node is decomposed into nodes rep¬ 
resenting single cars. Each single car is represented by 
an Or-node (e.g., the car and the 2^^ car), since we 
have different combinations of car types, viewpoints and 
occlusion configurations.Here, a multi-car And-node em¬ 
beds the reconfigurable compositional grammar of a multi¬ 
car configuration (e.g., the three 2-car configurations in the 
right-top of Fig|^ in which the single cars are reconfigurable 
w.r.t. viewpoint and occlusion configuration (up to some 
extend), and car type. This reconfigurability gives our model 
expressive power to handle the large variations of multi-car 
configurations in real sitations. 

in) Each occlusion configuration is represented by an And- 
node which is further decomposed into parts. Parts are 
learned using CAD simulation (i.e., the 17 semantic parts) 
and are organized into consistently visible parts and op¬ 
tional part clusters (see the example in the right-bottom 
of Fig. Then, a single car can be represented by the 
consistently visible parts (i.e.. And) and one of the optional 
part clusters (i.e.. Or). The green dashed bounding boxes 
show some examples corresponding to different occlusion 
configurations (i.e., visible parts) from the same viewpoint. 


1.2.3 Weakly-supervised Learning of the And-Or Model 

Using weakly-annotated real image training data and the 
synthetic data, we learn the And-Or model in two stages: 

i) Learning the structure of the hierarchical And-Or model. 
Both the multi-car contextual patterns and occlusion config¬ 
urations of single cars are learned automatically based on 
the annotated single car bounding boxes in training data 
together with the synthetic examples generated from CAD 
simulations. The multi-car contextual patterns are mined or 
clustered from the geometric layout features. The occlusion 
configurations are learned by a clustering method using 
the part visibility data matrix. The learned structure is a 
directed and acyclic graph since we have both single-car¬ 
sharing and part-sharing, thus Dynamic Programming (DP) 
can be applied in inference. 

ii) Learning the parameters for appearance, deformation and 
bias. Given the learned structure of the And-Or model, we 
jointly train the parameters in the structural SVM frame¬ 
work and adopt the Weak-Label Structural SVM (WLSSVM) 
method (T5| , fib) in implementation. 

1.2.4 Experiments 

In experiments, we evaluate the detection performance of 
our model on four car datasets: the KITTI dataset (T), the 
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PASCAL VOC2007 car dataset Q and two self-collected 
datasets - the Street-Parking dataset and the Parking Lot 
dataset 0 (which are released with this paper). Our model 
outperforms different state-of-the-art variants of DPM GZI 
(including the latest implementation [Ts) ) on all the four 
datasets, as well as other state-of-the-art models g), d), 
(T^ , ( 20 ) on the KITTI and the Street-Parking datasets. We 
evaluate viewpoint estimation performance on three car 
datasets: the PASCAL VOC2006 car dataset Q, the 3D car 
dataset 0, and the PASCAL3D+ car dataset 0. Our model 
achieves comparable performance with the state-of-the-art 
methods (significantly better than the method using deep 
learning features j^). The detection code and data are available 
on the author's homepage^ 

Paper Organization. The remaining of this paper is 
organized as follows. Section overviews the related work 
and summarizes our contributions. Section presents the 
And-Or model and defines its scoring functions. Section 

presents the method of mining multi-car contextual pat¬ 
terns and occlusion configurations of single cars in weakly- 
labeled training data. Section discusses the learning of 
model parameters using WLSSVM, as well as details of the 
DP inference algorithm. Section [^presents the experimental 
results and comparisons of the proposed model on the four 
car detection datasets and the three viewpoint estimation 
datasets. Section [^concludes the paper with discussions. 

2 Related Work and Our Contributions 

Over the last decade, object detection has made much 
progress in various vision tasks such as face detection (^ , 
pedestrian detection ||^, and generic object detection |0, 
(TtI , [ 2 ^. In this section we focus on occlusion and con¬ 
text modeling in object detection, and classify the recent 
literature into three research streams. For a full review of 
contemporary approaches, we refer the reader to recent 
survey articles ||^, ||^, | |^ . 

i) Single Object Modeling and Occlusion Modeling. Hier¬ 
archical models are widely used in the recent literature of 
object detection and most existing approaches are devoted 
to learning a single object model. Many work extended 
the deformable part-based model (which has a two- 
layer structure) by exploring deeper hierarchy and global 
part configurations |[^|, ||^, p^, using strong manually- 
annotated parts or CAD models j^, or keeping human 
in-the-loop (31) . To address the occlusion problem, various 
occlusion models estimate the visibilities of parts from 
image appearance, using assumptions that the visibility of 
a part is (a) independent from other parts p2) , 33 , 34 , 
( 35 ) , (36) , (b) consistent with neighboring parts |15 , 37 , 
or (c) consistent with its parent or child parts describing 
object appearance at different scales (^. Another essential 
problem is to organize part configurations. Recently, 0, 
(15) , ( 34 ) explored different ways to deal with this problem. 
In particular, modeled different part configurations by 
the local part mixtures. | [T5| used a more flexible grammar 
model to infer both the occluder and visible parts of an oc¬ 
cluded person. 0 regularized parts into consistently visible 
parts and optional part clusters, which is more efficient to 

3. http://www.stat.ucla.edu/~tfwu/projects.htm 
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represent occlusion configurations. Recent work (^ , (40) , 
( 41 ) , ( 4 ^ , ( 43 ) proposed to enumerate possible occlusion 
configurations and model each occlusion configuration as 
a specific component. (44) proposed a 2D model to learn 
discriminative subcategories, and (^ further integrated it 
with an explicit 3D occlusion model, both showing excellent 
performance on the KITTI dataset. Though those models 
were successful in some heavily occluded cases, they did 
not represent contextual information, and usually learned 
another separate context model using the detection scores 
as input features. Recently, an And-Or quantization method 
was proposed to learn And-Or tree models (^, (4^ for 
generic object detection in PASCAL VOC 0 and learn 3D 
And-Or models (47) respectively, which could be useful in 
occlusion modeling. 

ii) Object-Pair and Visual Phrase Models. To account for 
the strong co-occurrence, object-pair |[TT) , m Gi' 0 
and visual phrase flO) methods modeled occlusions and 
interactions using a X-to-X or X-to-Y composite template 
that spans both one object (i.e., "X" such as a person or 
a car) and another interacting object (i.e., "X" or "Y" such 
as the other car in a car-pair in parking lots or a bicycle 
on which a person is riding). Although these models can 
handle occlusion better than single object models, the object- 
pair or visual phrase modeled occlusion implicitly, and they 
were often manually designed with fixed structures (i.e., 
not reconfigurable in inference). They performed worse than 
original DPM in the KITTI dataset as evaluated by (14) . 

in) Context Models. Many context models have been ex¬ 
ploited in object detection with improved performance (48) , 
( 4 ^ , (^ , (^, (^. Hoiem et al. (^ explored a scene 
context, Desai et al. (^ improved object detectors by in¬ 
corporating the multi-class context on the pascal dataset 0 
in a max-margin framework. In (^ , Tu and Bai integrated 
the detector responses with background pixels to determine 
the foreground pixels. In (52) , Chen et. al. proposed a 
multi-order context representation to take advantage of the 
co-occurrence of different objects. Recently, (^ explored 
geographic contextual information to facilitate car detection, 
and ( 54 ) explored a 3D panoramic context in object detec¬ 
tion. Although these work verified that context is crucial in 
object detection, most of them modeled objects and context 
separately, not in a unified framework. 

This paper is extended from our two previous conference 
papers © 0 in the following aspects: (i) A unified repre¬ 
sentation is learned for integrating occlusion and context; 
(ii) More details on the learning algorithm and the detection 
algorithm are presented; (hi) More analyses and compar¬ 
isons on the experimental results are added with improved 
performance. 

This paper makes three contributions to the literature of 
car detection. 

i) It proposes an And-Or model to represent multi-car 
context and occlusion configurations. The proposed model 
is multi-scale and reconfigurable to account for large struc¬ 
ture, viewpoint and occlusion variations. 

ii) It presents a simple, yet effective, approach to mine 
context and occlusion configurations from weakly-labeled 
training data. 

hi) It introduces two datasets for evaluating occlusion 
and multi-car context, and obtains performance comparable 
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to or better than state-of-the-art car detection methods in 
four challenging datasets. 

3 Representation and Inference 
3.1 The And-Or Model and Scoring Functions 

In this section, we introduce the notations in defining the 
And-Or model and its scoring functions. 

An And-Or model is defined by a 3-tuple, Q = (V, 0), 

where V = VAnd U Vor U Vt, represents the nodes in three 
subsets: And-nodes VAnd/ Or-nodes Vor and Terminal-nodes 
Vt; E is the set of edges organizing all the nodes in a 
directed and acyclic graph (DAG); 0 = Qdef ^Qbias^^ 

is the set of parameters (for appearance, deformation and 
bias respectively, to be defined later). 

A Parse Tree is an instantiation of the And-Or model by 
selecting the best child (according to the scoring functions to 
be defined) for each encountered Or-node. The green arrows 
in Fig. show an example of parse tree. 

Appearance Features. We adopt the Histogram of Oriented 
Gradients (HOG) feature fTT] , (55) to describe appearance. 
Let I be an image defined on an image lattice. Denote by 
1-L the HOG feature pyramid computed for / using A levels 
per octave, and by A the lattice of the whole pyramid. Let 
p = {I, x,y) G A specify a position (x, y) in the /-th level of 
the pyramid TL. Denote by ^^PP{TL^pt) the extracted HOG 
features for a Terminal-node t placing at position pt in the 
pyramid. 

Deformation Features. We allow local deformation when 
composing the child nodes into a parent node. In our model, 
parts are placed at twice the spatial resolution w.r.t. single 
cars, while single cars and composite multi-cars are at the 
same spatial resolution. We penalize the displacements be¬ 
tween the anchor locations of child nodes (w.r.t. the placed 
parent node) and their actual deformed locations. Denote 
by 6 = [dx, dy] the displacement. The deformation feature 
is defined by, 

= [dx^,dx,dy‘^,dy]'. 

A Terminal-node t G Vt grounds a single car or a part 
to image data (see Layer 3 and 4 in Fig|^. Given a parent 
node A, the model for t is defined by a 4-tuple 

where C 0^^^ is the appearance template, St G {0,1} 
the scale factor for placing node t w.r.t. its parent node, 
a two-dimensional vector specifying an anchor position 
relative to the position of parent node A, and C 0^®-f 
the deformation parameters. Given the position pA = 
{Ia^xa^Pa) of the parent node A, the scoring function of 
a Terminal-node t is defined by, 

score{t\A,pA) = max( < > — 

(5g A 

( 1 ) 

where A is the space of deformation (i.e., the lattice of the 
corresponding level in the feature pyramid), pt = (/t, Xt^yt) 
with It = - stX and {xt,yt) = V*{^A,yA) + at\A + ^ 

where St = 0 means the object and parts are placed at the 
same resolution and St = 1 means parts are placed at twice 


the resolution of the object templates, and < > denotes 

the inner product. Fig|^ shows some learned appearance 
templates. 

An And-node A G VAnd represents a decomposition 
of a large entity (e.g., a multi-car layout at Layer 1 or a 
single car at Layer 3 in Fig|^ into its constituents (e.g., 
2 or 3 single cars or a small number of parts). Single 
car And-nodes are associated with viewpoints. Unlike the 
Terminal-nodes, single car And-nodes are not allowed to 
be deformable in a multi-car configuration in this paper 
(we implemented it in experiments and did not observe 
performance improvement, so for simplicity we make them 
not deformable). Denote by ch{v) the set of child nodes of 
a node v G VAnd U Vor- The position pA of an And-node A 
is inherited from its parent Or-node, and then the scoring 
function is defined by, 

score{A,PA) = ^ score{v\A^pA) E bA (2) 

vEch(A) 

where Ba G 0^*^^ is the bias term. Each single car And- 
node (at Layer 3) can be treated as the DPM fVT] or the 
And-Or structure proposed in j^. So, our model is flexible 
to integrate state-of-the-art single object models. For multi¬ 
car And-nodes (at Layer 1), their child nodes are Or-nodes 
and the scoring function score{v\A,pA) is defined below. 

An Or-node O G Vor represents different structure 
variations (e.g., the root node and the i-th car node at Layer 
2 in Fig|^. For the root Or-node O, when placing at the 
position p G A, the scoring function is defined by, 

scoreiO^p) = max scoreiv^p)^ (3) 

vEch{0) 

where ch{0) C VAnd- For the i-th car Or-node O, given 
a parent multi-car And-node A placed at pa, the scoring 
function is then defined by, 

score{0\A^PA) = max max(score(i;,p^) — 

vEch{0) (5gA 

>), (4) 

where py = (ly.Xy.yy) with ly = I a and {xy.py) = 

Va) E S. The best child of an Or-node is computed by 
taking argmax of Eqn.j^ and Eqn.<|^. 

3.2 The DP Algorithm in Detection 

In detection, we place the And-Or model at all positions 
p G A and retrieve the optimal parse trees for all positions 
at which the scores are greater than the detection threshold. 
Thank to the directed and acyclic structure of our And- 
Or model, we can utilize the efficient DP algorithm which 
consists of two stages: 

In the bottom-up pass: Following the depth-first-search 
(DFS) order of nodes in the And-Or model, the bottom-up 
pass computes the matching scores of all possible parse trees 
of the And-Or model at all possible positions in the whole 
feature pyramid. 

First of all, we compute the appearance score maps 
(pyramid) for all Terminal-nodes (which is done by filter 
convolution). The optimal position of a Terminal-node w.r.t. 
a parent node can be computed as a function of the position 
of the parent node. The quality (matching score) of the 
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Optimal position for a Terminal-node w.r.t. a given posi¬ 
tion of the parent is computed using Eqnj^ (which yields 
the deformed score map through the generalized distance 
transform trick as done in the DPM p7) for efficiency), and 
the optimal position can be retrieved by replacing max in 
Eqn.|^ with arg max. 

Then, following the DES order of nodes, we compute 
the score maps for all the And-nodes and Or-nodes using 
Eqn. and with the score maps of their child nodes 

having been computed already Similarly, we can obtain the 
optimal branch for each Or-node by replacing the max in 
Eqn.<|^ and ||^ with arg max. 

In the top-down pass, we first find all detection candidates 
for the root Or-node O based on its score maps, i.e., the posi¬ 
tions P = {p; score{0,p) > r andp G A}. Then, following 
the breadth-first-search (BES) order of nodes, we retrieve 
the optimal parse tree at each p G P: starting from the root 
Or-node, we select the optimal branch of each encountered 
Or-node, keep all the child nodes of each encountered And- 
node, and retrieve the optimal position of each Terminal- 
node. Based on the parsed sub-tree rooted at single car And- 
nodes, we obtain the viewpoint estimation and the occlusion 
configuration. 

Post-processing. To generate the final detection results of 
single cars for evaluation, we apply multi-car guided non¬ 
maximum suppression (NMS) to deal with occlusions: 

i) Some of the single cars in a multi-car detection can¬ 
didate are highly overlapped due to occlusion, so if we 
directly use conventional NMS, we will miss the detection of 
the occluded cars. We enforce that all the single car bound¬ 
ing boxes in a multi-car prediction will not be suppressed 
by each other. A similar idea is also used in p2] |. 

ii) Overlapped multi-car detection candidates might re¬ 
port multiple predictions for the same single car. Eor exam¬ 
ple, if a car is shared by a 2-car detection candidate and a 
3-car detection candidate, it will be reported twice. We will 
keep only the one with higher score. 

4 Learning And-Or Structures 

In this section, we present the methods of learning the 
structures of And-Or model by mining contextual patterns 
and occlusion configurations in the positive training dataset. 

4.1 Generating Multi-car Training Samples 

Positive Samples. Denote by ,(4,1^)} 

the positive training dataset with = {Bj = 

{xj )}jLi being the set of ki annotated single car 

bound boxes in image Ii. Here, {x^y) is the left-top corner 
and {w^ h) the width and height. 

Denote the set of A^-car positive samples by, 

= {{luBi)- \J\ = N,Bf [l,n]}. (5) 

where all the //s have more than N annotated single cars 
(i.e., ki > N). We have, 

i) consists of all the single car bounding boxes 
which do not overlap the other ones in the same image. Eor 
N>2, D+_ -car generated iterativ^. 

ii) In generating (see Eig0(a)), for each positive 

image (/^,B^) G D+ with ki > 2, we enumerate all valid 



Fig. 4. Illustration of generating multi-car positive samples. 

2-car configurations starting from Bj G B^: we first select 
the current Bl as the first car (1 <j<k i), obtain all the 
surrounding car bounding boxes which overlap Bl, 
and then select the second car B^ G which has the 
largest overlap if Afgj 7 ^ 0 and ^ (J = 

k}). ' 

iii) In generating (N > 2, see FigW(b)), for each 

positive image with ki>N and 3(J,,i?f) e 
we first select the current B^ as the seed, obtain the neigh¬ 
bors each of which overlaps at least one bounding 

box in Bf^, and then select the bounding box Bj G N'b^ 
which has the largest overlap and add {li^Bf) to 
{J = KU{j}). 

Negative Samples. We collect negative samples in im¬ 
ages without cars appearing provided in the benchmark 
datasets and apply the hard negative mining approach 
during learning parameters as done in the DPM p7) . 

4.2 Mining Multi-car Contextual Patterns 

This section presents the method of learning multi-car pat¬ 
terns in Layer 0 — 2 in Eig|^ Considering N > 2, we use 
the relative positions of single cars to describe the layout of 
a multi-car sample (1^,5/) G D^j-car' E)enote by {cx^cy) 
the center of a car bounding box (J = {1, • • • , N}). Let wj 
and hj he the width and height of the union bounding box 
of BI respectively. With the center of the first car being the 
centroid, we define the layout feature by, 

^cxl - cxj cyf - cyj cxf - cxj cyf^ - cyj ^ 
wj hj wj hj 

We cluster these layout features over to get T 

clusters using /c-means. The obtained clusters are used to specify 
the And-nodes at Layer 1 in Eig|^ The number of cluster T is 
specified empirically for different training datasets in our 
experiments. 

In Eig. (top), we visualize the clustering results for 
Bt-car on the KITTI 0 and the Parking Lot datasets. Each 
set of color points represents a 2-car context pattern. In 
the KITTI dataset, we can observe there are some car-to-car 
"peak" modes in the dataset (similar to the analyses in [[14|), 
while the context patterns are more diverse in the Parking 
Lot dataset. 

4.3 Mining Occlusion Configurations 

In this section we present the method of learning occlusion 
configurations for single cars in Layer 3 and 4 in Pig|^ 
We learn the occlusion configurations automatically from 
a large number of occlusion configurations generated by 
CAD simulations. Note that the synthetic data are used 
to learn the occlusion configurations, while the appearance 
and geometry parameters are still learned from real data. 
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Fig. 5. Left-Top: 2-car 
context patterns on the 
KITTI dataset and 

self-collected Parking Lot 
dataset. Each context 
pattern is represented by a 
specific color set, and each 
circle stands for the center 
of each cluster. Left-Bottom: 
Overlap ratio histograms 
of the KITTI dataset and 
the Parking Lot dataset 
(we show the occluded 
cases only). Right: some 
cropped examples with 
different occlusions. The 2 
bounding boxes in a car pair 
are shown in red and blue 
respectively. (Best viewed in 
color). 


4.3.1 Generating Occlusion Configurations 


As mentioned in Sec |1.2.1 we choose to put 3 cars in 
generating occlusion configurations. Specifically, we choose 
the center and 2 other randomly selected positions on a 
3x3 grid, and put cars around these grid points to simulate 
occlusions. See some examples in Fig|^ 

The occlusion configurations reflect the four factors: car 
type t, orientation p , relative position r and camera view 
n. To generate an occlusion configuration, we randomly 
assign values for these factors, where for each car with 
type i, pi G {frontal,rear}, + dr, where 

is the nominated position for the i-th car on the 3x3 
grid, and dr = (dx^dy) is the relative distance (along x 
axis and y axis) between sampled position and nominated 
position of the i-th car. The camera view is in the range of 
azimuth G [0, 27r] and elevation G [0, 7r/4], we discretize the 
view space into B view bins uniformly along the azimuth 
angle. In the synthesized configurations, a part is treated as 
occluded if 60% of its area is not visible. 


4.3.2 Constructing the Initial And-Or model of Single Cars 

With the part-level visibility information, we compute two 
vectors for each occlusion configuration: The first is a (17 
parts X 5 camera views) dimension binary valued vector v 
for the visibilities of parts; and the second is a real valued 
(( 1 root +17 parts) xB camera views x 4) dimension vector 
b for the bounding boxes and parts. In both vectors, entries 
corresponding to invisible parts are set to 0. 

Denoting M as the dimension of the vector vecv, and by 
stacking vecv for N occlusion configurations, we can get an 
N X M occlusion matrix V, where the first few rows of this 
matrix for B = 8 is shown in the right side in Fig|^ Note 
that we have partitioned the view space into B views, so for 
each row, the visible parts always concentrate in a segment 
of the vector representing that view. 

In learning an initial And-Or model, each row in V 
corresponds to a small subtree of the root OR node. In 
particular, each subtree consists of an And-node as the root 
and a set of terminal nodes as its children. An example of 
the data matrix and corresponding initial And-Or model is 
shown in the middle in Fig|^ 


4.3.3 Refining the And-Or Structure 

The initial And-Or model is large and redundant, since it has 
many duplicated occlusion configurations (i.e. duplicated 
rows in V) and a combinatorial number of part composi¬ 
tions. In the following, we will pursue a compact And-Or 
structure. The problem can be formulated as: 

N 

miny] I Vi - Vi{g) \l+\\Q\ (7) 

i 

where Vi is the i-th row of the data matrix V, v(Q) returns 
its most approximate occlusion configuration generated by 
the And-Or graph (AOG), |0| is the number of nodes and 
edges in the structure, and A is the trade-off parameter 
balancing the model precision and complexity. In each view, 
we assume the number of occlusion branches is not greater 
than K(= 4). 

We solve Eqn|^ using a modified graph compression 
algorithm similar to 1^ . As illustrated in the right side in 
Fig^ the algorithm starts from the initial And-Or model, 
and iteratively combines branches if the introduced loss was 
smaller than the decrements in complexity term \\Q\. This 
process is equivalent to iteratively finding large blocks of Is 
on the corresponding data matrix through row and column 
permutations, where an example is shown in the bottom in 
Fig|^ As there are consistently visible parts for each view, 
the algorithm will quickly converge to the structure shown 
in Fig j^ 

With the refined And-Or model, we compute occlu¬ 
sion configurations (i.e., the consistently visible parts and 
optional occluded parts) in each view. In addition, the 
bounding box size and nominal position of each Terminal- 
node w.r.t. its parent And-node can also be estimated by 
geometric means of corresponding values in the vector 
b. These information will be used to initialize the latent 
variables of our model in learning the parameters. 

Variants of And-Or Models. We will test our model 
using two types of specifications to be consistent with our 
two previous conference papers, one is called And-Or Struc¬ 
ture for occlusion modeling based on CAD simulation 
without multi-car context components, and the other called 
Hierarchical And-Or Model Q for occlusion and context. We 
also compare two methods of part selection in hierarchical 
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Fig. 6. Illustration of learning occlusion configurations. It consists of three components: (i) Generating occlusion configurations using CAD 
simulations with 17 semantic parts in total; (ii) Learning the initial And-Or structure based on the data matrix constructed from the simulated 
occlusion configurations. Each row of the data matrix represents an example and the columns represent the visibility of the 17 semantic parts (a 
white/gray entry denotes a part is visible/invisible. Each example is represented by an And-node as one child of the root Or-node; (iii) Refining the 
initial And-Or structure using graph compression algorithm [5^ to seek the consistently visible parts (e.g., X) and optional part clusters (e.g., Y 
and Z). 


And-Or model, one is based on the greedy parts as done in 
the DPM (Tt) , denoted by AOG+Greedy, and the other based 
on the proposed CAD simulation, denoted by AOG+CAD. 

5 Learning Parameters 

With the learned And-Or structure, we adopt the 
WLSSVM method in learning the parameters 0 = 
f-Qapp^ Qdef ^ Qbias^ appearance, deformation and bias). 
When the occlusion configurations are mined by CAD 
simulations (i.e., for the two model specifications, And-Or 
Structure and AOG-lCAD), we will use both the Step 0 and 
Step 1 below in learning parameters, otherwise we use Step 
1 only (i.e., for AOG-hGreedy). 

Step 0: Initializing Parameters with Synthetic Training 
Data. We learn the initial parameters 0 with synthetic 
training data (see Figj^. We randomly superimpose the 
synthetic positive samples on some randomly selected real 
images without cars appearing (instead of using white 
background directly, see Fig[^ to reduce the appearance 
gap between the synthetic samples and real car samples. 
In the synthetic data, the parse tree pt for each multi-car 
positive sample is known except that the positions of parts 
are allowed to deform. 

Step 1: Learning Parameters with Real Training Data. 

In the real training data, we only have annotated bounding 
boxes for single cars. The parse tree pt for each multi-car 
positive sample is hidden except for the multi-car config¬ 
uration which can be computed based on the annotated 
bounding boxes of single cars as stated in Sec |4.2| Then, we 
initialize the parse tree for each positive sample either based 
on the initial parameters learned in step 0 (for the And-Or 
structure and AOG-lCAD) or using a similar idea as done 
in learning the mixture of DPMs [17] to initialize the single¬ 
car And-nodes for AOG-LGreedy. After the initialization, the 
parameters 0 are learned iteratively under the WLSSVM 
framework. During learning, we run the DP inference to 
assign the optimal parse trees for multi-car positive samples. 


The objective function to be minimized is defined by, 

M 

£ie) = -\\ef + cJ2L'ie,Xi,yi) ( 8 ) 

i=l 

where Xi G ^^_car represents a training sample (A^ > 1) 
and Hi is the N bounding box(es). I/'(0, x, y) is the surrogate 
loss function. 


L'{Q,x,y) = max [score{x,pt]Q) ^ Lmargin{yMx{pt))]- 

pt^ilg 

max [sc(yre{x,pt\ 0) - Loutputiu, box{pt))] (9) 

pt^Qg 


where Qg is the space of all parse trees derived from the 
And-Or model Q, score{x,pt;&) computes the score of a 
parse tree as stated in Sec^ and box{pt) the predicted 
bounding box(es) base on the parse tree. As pointed out 
in |[^, the loss Lmargin{lpbox{pt)) encourages high-loss 
outputs to "pop out" of the first term in the RHS, so that 
their scores get pushed down. The loss L output box {pt)) 
suppresses high-loss outputs in the second term in the right 
hand side, so the score of a low-loss prediction gets pulled 
up. More details are referred to |^, In general, since V 
in Eqn.|^ is not convex, the objective function, Eqn.(|^ leads 
to a nonconvex optimization problem. The WLSSVM adopts 
the CCCP procedure in optimization, which can find a 
local optima of the objective. The loss function is defined by. 


L^^r{yMx{pt)) 


i iiy and pt 

0 iiy —X and pt =_L 
i iiy /_L and 3 B e y 

with ov{B, B') < r,yB' G box(pt) 
0 iiy and ov{B, B') > r, 

y B e y and 3B' G box{pt) 

( 10 ) 


where A represents background output and ov(’,’) is the 
intersection-union ratio of two bounding boxes. Eollowing 
the PASCAL VOC protocol we have L mar gin = ^i,o.5 and 
Loutput = ^ 00 , 0 . 7 * In practice, we modify the implementa¬ 
tion in (Is) for our loss formulation. 
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Fig. 7. Top: The distribution of overlap ratio and cars per image on the 
Street-Parking dataset. Bottom: Comparison of the average number of 
cars per image. 


6 Experiments 

In this section, we evaluate our models on four car detection 
datasets and three car viewpoint estimation dataset and 
present detail analyses on different aspects of our models. 
We first introduce two self-collected car datasets of street¬ 
parking cars and parking-lot cars respectively (Sec. |6.1| i, and 
then evaluate the detection performance of our models on 
four datasets (Sec . |6.2| i: the two self-collected datasets, the 
KITTI car dataset Q and the PASCAL VOC2007 car dataset 
Q. We further analyze the performance of our model w.r.t. 
different aspects of our models (Sec. |6.3[ i. The performance 
of car viewpoint estimation is presented in Sec. |6.4[ 

Training and Testing Time. In all experiments, we utilize 
a parallel computing technique to train our model. It takes 
about 9 hours to train an And-Or Structure model and 16 
hours to train a hierarchical And-Or Model due to inferring 
the assignments of part latent variables on positive training 
examples and mining hard negatives. For detection, it takes 
about 2 and 3 seconds to process an image with size of 640 x 
480 pixels for a And-Or structure and a hierarchical And-Or 
model, respectively. 

6.1 Datasets 

To test our model on occlusion and context modeling, we 
collected two car datasets 0 

The Street Parking Car Dataset. There are several 
datasets featuring a large amount of car images 0, B OH 
l^, but they are not suitable to evaluating occlusion han¬ 
dling, as the proportion of (moderately or heavily) occluded 
cars is marginal. The recently proposed KITTI dataset Q 
contains occluded cars parked along the streets, but it can 
not fully evaluate the ability of our model since the car 
views are rather fixed as the video sequences are captured 
from a car driving on the road (e.g., no birdeye's view). 
In addition, the average number of cars on each image is 
still not large enough (mostly 3 cars, see the statistics in 
the bottom in Fig. 0. To provide a more challenging occlu¬ 
sion dataset, we collected one emphasizing street parking 
cars with heavy occlusions, diverse viewpoint changes and 
much larger number of cars per image (see the last two rows 
in Fig|^. The dataset consists of 881 images. Fig.j^shows the 
bounding box overlapping distribution and average number 
of cars per image. For the simplicity of annotation, we only 

4. http://www.stat.ucla.edu/~boli/publication/street-parking- 
release.zip and parking_lot_release.zip 


KITTI Half-Trainset Parking Lot Dataset 



Fig. 8. Precision-recall curves on the test subset splitted from the KITTI 
trainset (Left) and the Parking Lot dataset (Right). 

label the bounding boxes of single cars in each image. We 
split the dataset into training and testing sets containing 440 
and 441 images, respectively. 

The Parking Lot Dataset. Our Street Parking Car Dataset 
provides more viewpoints, however, the context and oc¬ 
clusion configurations are relatively restricted (most cars 
just compose the head-to-head occlusions). To thoroughly 
evaluate our models in terms of both context and occlusions, 
we collected the parking lot car dataset, which has larger 
occlusion variations and larger number of cars in each image 
(see the 4-th and 5-th rows in Fig.0. It contains 65 training 
images and 63 testing images. Although the number of 
images is small, the number of cars is noticeably large, with 
3, 346 cars (including left-right mirrored ones) for training 
and 2, 015 cars for testing. 

6.2 Detection 

We test our hierarchical And-Or Model on four challenging 
datasets. 

6.2.1 Results on the KITTI Dataset 

The KITTI dataset Q contains 7,481 training images and 
7,518 testing images, which are captured from an au¬ 
tonomous driving platform. We follow the provided bench¬ 
mark protocol for evaluation. Since the authors of Q have 
not released the test annotations, we test our model in the 
following two settings. 

Training and Testing by Splitting the Trainset. We 

randomly split the KITTI trainset into the training and 
testing subsets equally. 

Baseline Methods. Since DPM is a very competitive 
model with source code publicly available, we compare our 
model with the latest version of DPM (i.e., voc-release5 fTS)). 
The number of components are set to 16 as the baseline 
methods trained in Q, other parameters are set as default. 

Parameter Settings. We consider multi-car contextual pat¬ 
terns with the number of cars N = 1,2. We set the number 
of context patterns and occlusion configurations to be 10 
and 16, respectively. As a result, the learned hierarchical 
And-Or model has 10 2-car configurations in layer 1, and 16 
single car branches in layer 3 (see Fig.[^. 

Detection Results. The left figure in Fig. shows the 
precision-recall curves of DPM and our model. Our model 
outperforms DPM by 9.1% in terms of average precision 
(AP). The performance gain comes from both precision and 
recall, which shows the importance of context and occlusion 
modeling. 
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Methods 

Easy 

Moderate 

Hard 

mBow [19] 

36.02% 

23.76% 

18.44% 

LSVM-MDPM-us |17l 

66.53% 

55.42% 

41.04% 

TSVM^MDPM^svTiTjTrSor 

68.02% 

56.48% 

44.18% 

MDPM-un-BB |17] 

71.19% 

62.16% 

48.43% 

OC-DPM |1^ 

74.94% 

65.95% 

53.86% 

DPM 118] (trained by us) 

77.24% 

56.02% 

43.14% 

MV-RGBD-RF (60j 

76.40% 

69.92% 

57.47% 

SubCat [44] 

84.14% 

75.46% 

59.71% 

3DVP (45 J 

87.46% 

75.77% 

65.38% 

Regionlets jbl] 

84.75% 

76.45% 

59.70% 

AOG+Greedy-Half 

84.36% 

71.88% 

59.27% 

AOG+Greedy-Full 

84.80% 

75.94% 

60.70% 


TABLE 1 

Performance comparison (in AP) on the KITTI benchmark [i]. 



DPM ||T^ 

And-Or Structure 16] 

AOG+Greedy 

AOG+CAD 

AP 

52.U7o 

57.8% 

62.1% 

65.3% 


TABLE 2 

Performance comparison (in AP) on the Street Parking dataset (^. 


Testing on the KITTI Benchmark. We evaluate our 
model with two different training data settings: one trained 
using half training set on the KITTI testset, denoted by 
AOG+Greedy-Half, and the other trained with full training 
set, denoted by AOG+Greedy-Full (which has 16 context 
patterns and 32 occlusion configurations). 

The benchmark has three subsets {Easy, Moderate, Hard) 
w.r.t the difficulty of object size, occlusion and trunca¬ 
tion. All methods are ranked based on performance in 
the moderately difficult subset. Our entry in the bench¬ 
mark is "AOG". Table [T] shows the detection results of 
our model and other state-of-the-art models. Here, we omit 
the CNN-based method, as they are all anonymous sub¬ 
missions. Details of the benchmark results are available at 
htty://www.cvlibs.net/datasets/kitti/eval_ohiect.yhy. 

Our AOG+Greedy-Full outperforms all the DPM-based 
models. Compared with their best model, OC-DPM | pA| , 
our model improved performance on the three subsets by 
9.86%, 9.99%, and 6.84% respectively. We also compare 
with the baseline DPM trained by ourselves using the voc- 
releaseS code |[^|, and obtain 7.56, 19.92% and 17.56% 
performance gains on the three stubsets. For other DPM 
based methods trained by the benchmark authors, our 
model outperforms the best one - MDPM-un-BB by 13.61%, 
13.78% and 12.27% respectively. 

Our model is comparable with SubCat ji^, 3DVP (45) 
and Regionlets EL We achieve slightly better performance 
than Regionlets (61 1 on the Easy and Hard sets, but lose 
a bit AP on the Moderate set. Though our method obtains 
better rank than 3DVP (^ on the moderately difficult set, 
it performs slightly worse on the easy and hard subsets, 
which shows the promise of 3D occlusion modeling and 
subcategory clustering (44) , (45) . 

Comparing AOG+Greedy-Half and AOG+Greedy-Full, 
we can observe that the major improvement (4.06%) of 
AOG+Greedy-Full comes from the Moderate set, while on 
the Easy and Hard sets, we obtain small improvement (0.44% 
and 1.43%, respectively). These results meet some analyses 
in l ]^ , which indicate there are still large potential im¬ 
provement on object representation, and much effort should 
be devoted to improving our current hierarchical And-Or 


model. 

The first 3 rows in Fig. show the qualitative results 
of our model. The red bounding boxes show successful 
detection, the blue ones missing detection, and the green 
ones false alarms. In experiments, our model is robust to 
detect cars with heavy car-to-car occlusions and background 
clutters. The failure cases are mainly due to extreme oc¬ 
clusions, extremly low resolution, large car deformation 
and/or inaccurate (or multiple) bounding box localization. 

6.2.2 Results on the Parking Lot Dataset 

Evaluation Protocol. We follow the PASCAL VOC evaluation 
protocol (^ with the overlap of intersection over union 
being greater than or equal to 60% (instead of original 50%). 
In practice, we set this threshold to make a compromise 
between localization accuracy and detection difficulty. The 
detected cars with bounding box height smaller than 25 
pixels do not count as false positives as done in (^. We 
compare with the latest version of DPM implementation 
(TS) and set the number of contextual patterns and occlusion 
configurations to be 10 and 18 respectively. 

Detection Results. The right side in Fig. shows the 
performance comparisons between our model and DPM. 
Our model obtains 55.2% in AP, which outperforms the 
latest version of DPM by 10.9%. The fourth and fifth rows 
in Fig. 1^ show the qualitative results. Our model is capable 
of detecting cars with different occlusions and viewpoints. 

6.2.3 Results on the Street Parking Dataset 

To compare with the benchmark methods, we follow the 
evaluation protocol provided in ||^. 

Results of our model and other benchmark methods are 
shown in Table our hierarchical And-Or model outper¬ 
forms DPM (Ts) and our previous And-Or Structure (Dby 
10.1% and 4!3% respectively. We think the performance is 
improved due to the joint representation of context patterns 
and occlusion configurations. The last two rows in Fig. 
show some qualitative examples. Our model is capable of 
detecting occluded street-parking cars, meanwhile it also 
has a few inaccurate detection results and misses some cars 
(mainly due to low resolution). 

6.3 Diagnosing the Performance of our Modei 

In this section, we evaluate various aspects to diagnose the 
effects of each individual component in our model. 

6.3.1 The Effect of Occlusion Modeling 
Our And-Or Structure model is based on CAD simulation. 
Thus in the first analysis, we test the effectiveness of the 
learned And-Or structure in representing different occlusion 
configurations. To this purpose, we generate a synthetic 
dataset using 5,040 3-car synthetic images as our training 
data, and a mixture of 3,000 3-car and 7-car (placed in 
a 1 X 7 grid) synthetic images as our testing data. For 
each generated image, we add the background from the 
category None of the TU Graz-02 dataset ||^ and apply 
Gaussian blur to reduce the boundary effects. Samples of the 
training and testing data are shown on the left and middle 
in Figj^ In experimental comparisons, the best DPM has 16 
components and the best And-Or structure has 8 views with 
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Fig. 9. Examples of 
successful and failure 
cases by our model on 
the KITTI dataset (first 
3 rows), the Parking Lot 
dataset (the 4-th and 
5-th rows) and the Street 
Parking dataset (the last 
two rows). Best viewed in 
color and magnification. 


Tra inset Testset 



Fig. 10. Left and Middle: Training and testing samples from the synthetic 
dataset. Right: detection results of DPM and And-Or Structure. 

19 occlusion configurations, 5 layers and 111 nodes in total. 
As shown in the right side in Figj^ our model outperforms 
the DPM by 7.2% in AP. 

6.3.2 The Effect of CAD Simulation in Real Situations 
To verify the effectiveness of our And-Or Structure model 
in terms of occlusion modeling, we compare it with state- 
of-the-art DPM GD Both of these two models are based on 
part-level occlusion modeling. The And-Or Structure learns 
semantic visible parts based on CAD simulations. The DPM 
handles occlusion implicitly by introducing a trunction fea¬ 
ture at each HOG cell. The second and third column in Table 
l^show their performance on Street Parking dataset. We can 
see the semantic visible parts learned from CAD simulations 
can generalize to real datasets. By adding context, we are 
interested in whether it affects the effectiveness of occlusion 
modeling. To compare AOG-hGreedy and AOG+CAD fairly, 
they have the same number of context patterns and occlu¬ 
sion configurations, 8 and 16 respectively. As shown in the 
fourth and fifth column in Table AOG-lCAD performs 
better than AOG-LGreedy, which shows the advantage of 
modeling occlusion using semantic visible parts. 

Fig. shows the inferred part bounding boxes by 
AOG-LGreedy and AOG-lCAD. We can observe that the 


car 

DPM |ig 

And-Or Structure |[^ 

AOC-hCreedy 

AP 

58.2% 

58.7% 

60 . 6 % 


TABLE 3 

Performance comparison (in AP) on the PASCAL VOC 2007 |^. 


semantic parts in AOG-lCAD are meaningful, although they 
may be not accurate enough in some examples. 

6.3.3 The Effect of Multi-car Context Modeling 

The state-of-the-art models are mainly based on single car 
modeling. To evaluate the effectiveness of context, we com¬ 
pare our hierarchical And-Or model with other non-context 
models in Table We can see that our model outperforms 
all other models in different occlusion settings. Specifically, 
our model outperforms DPM by a large margin (above 10% 
in AP) on the "Moderate" and "Hard" KITTI test data, 
which shows context is very important to object detection 
especially in heavily occluded car-to-car situations. 

On the Street Parking dataset, we observe the same 
results. In Table both AOG-LGreedy and AOG-lCAD 
outperform DPM and And-Or Structure by a large margin. 
Here, AOG-LGreedy and AOG-lCAD jointly model context 
and occlusions, while DPM and And-Or Structure model 
occlusions only. 

6.3.4 Performance on General Occlusion Settings 

Our model is generalizable in terms of context and occlusion 
modeling, it can cope with both occlusion and non-occlusion 
situations. To verify our model on less occluded settings, 
we use the PASCAL VOC 2007 Car dataset as a testbed. As 
analyzed by Hoiem, et. al. in 0, cars in the PASCAL VOC 
dataset do not have much occlusions and car-to-car context. 

We first show that our And-Or Structure is capable to 
detect cars on the PASCAL VOC 2007 as well as the DPM 
method jlS) . To approximate the occlusion configurations 
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Fig. 11 . Visualization of part layouts output by our AOG+Greedy (Top) and AOG+CAD (Bottom). Best viewed in color and magnification. 


Pascal VOC 2006 Car Dataset | 

a 


DPM 




1 

n 


1 


E 

ours 

MPPE 

0.69 

1 

tTfc 

1 

( 


6 

( 

[EFT 

0.73 


3D Car Dataset (sl 



DPM 

4iL 

t] 

| 681 | 



ours 

AP 

99.6 

96 

76.7 

99.2 

9f>.9 

99.7 

99.9 

MPPE 

86.3 

89 

70 

85.3 

97.9 

96.3 

94 


TABLE 4 

View Estimation on Pascal VOC 2006 Car Dataset and 3D Car 
Dataset (sl (||i and ( 3^2 refer to DPM-VOC+VP and 
^FWSD-Constraints, respectively. 

observed on this dataset, we generate synthetic images with 
car-to-car occlusions and car self-occlusions. For the car-to- 
car occlusions, we use the full 3x3 grid instead of the 
special case in the street parking dataset. Correspondingly, 
the learned And-Or structure contains branches for self¬ 
occlusions as well as those for car-to-car occlusions. On 
this dataset, the DPM has 6 components and the And-Or 
structure has 6 views with 10 occlusion configurations, 5 
layers and 109 nodes. 

The third column in Table shows the performance 
of our And-Or structure model and the DPM. Our model 
achieves slightly better recall than DPM, which meets the 
analysis in ||^. This experiment shows that our And-Or 
structure method does not lose performance in general 
datasets. 

Then, we verify our hierarchical And-Or model is capa¬ 
ble to detect cars on the PASCAL VOC 2007 as well as other 
single object models. We compare with the latest version of 
DPM 1^. The APs are 60.6% (our model) and 58.2% (DPM) 
respectively (Table |^. 

6.4 View Estimation 

With the help of CAD simulations, our And-Or Structure 
model can compute the viewpoints of detected cars. To 
verify the capability of view estimation, we perform 2 
experiments. 

Firstly, we report the mean precision in pose estimation 
(MPPE), equivalent to the means of confusion matrix diag¬ 
onals, on both the Pascal VOC 2006 car dataset and 
the 3D Object dataset j^. The 3D Object Classes dataset 


is introduced in 2007. For each class, it has images 
of 10 different object instances with 8 different poses. We 
follow the evaluation protocol described in 7 randomly 
selected car instances are used for training, and 3 instances 
for testing. The 2D car bounding boxes are computed from 
the annotated segmentation masks. The negative examples 
are collected from the PASCAL VOC 2007 car dataset. For 
the VOC 2006 car database |^ , there are 469 cars with 
viewpoint labels (frontal, rear, left and right). We only use 
these labeled images with the standard training/test split. 
The detection performance is evaluated through precision- 
recall (PR) curve. For view estimation, the two datasets 
emphasize visible cars. Our And-Or structure has 8 views 
with 8 (self-occlusion) branches, 5 layers and 90 nodes. Table 
shows the comparison of our model with the state-of-the- 
art methods on these two datasets. Our model is comparable 
to or better than some recently proposed models |[^, (64) , 

m- 

Secondly, we compare our model with the state-of-the- 
art models on the recently proposed PASCAL3D+ Dataset 
j^. This dataset augments 12 rigid categories in the PAS¬ 
CAL VOC 2012 (2) with 3D annotations by fitting CAD 
models with 2D images semi-manually. It is a challenging 
dataset for 3D object detection and pose estimation. We test 
on the car category. We use the metric - Average Viewpoint 
Precision (AVP) (^ to simultaneously evaluate 2D bounding 
box localization and viewpoint estimation. In computing the 
AVP, a candidate detection is considered to be a true positive 
if and only if the bounding box overlap is larger than 50% 
and the viewpoint is correct. 

Table H] shows the results of our model and the state- 
of-the-art methods. Our method is better than VDPM (^ 
and a deep-cnn-feature-based model (decaf) j^. Our And- 
Or Structure is comparable with j^, which also used CAD 
models to learn viewpoints and part-level car geometry. 

7 Conclusion 

In this paper, we present an And-Or model to represent 
context and occlusion for car detection and viewpoint es¬ 
timation. The model structure is learned by mining multi¬ 
car contextual patterns and occlusion configurations at three 
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VDPM 

4] 

DPM-VOC+VP 

(fisher-Lspm) 

(decaf) 


our And-Or Structure 

4 views 

37.2%/2C 

1 : 2 % 

45.6%/36.9% 

36.1%/28.9% 

36.1%/2tD 


43.0%/34.3% 

8 views 

37.3%/23.5% 

47.6%/36.6% 

36.1%/26.6% 

36.1%/23.3% 

44.9%/33.2% 

16 views 

36.6%/18.1% 

46.0%/29.6% 

36.1%/19.6% 

36.1%/19.4% 

43.2%/27.6% 

24 views 

36.3%/13.7% 

42.1%/24.6% 

36.1%/15.9% 

36.1%/16.7% 

41.1%/22.9% 


TABLE 5 

The results of VDPM, DPM-VOC+VP and And-Or Structure on the PASCAL3D+ Car Dataset The first number indicates the average precision 
(AP) for detection and the second number shows the average viewpoint precision (AVP) rar joint object detection and view estimation. 


levels: a) multi-car layouts, b) single car and c) parts. 
Our model is organized in a directed and acyclic graph 
structure so the efficient DP algorithm can be used in 
inference. The model parameters are learned by WLSSVM 
p5) . Experimental results show that our model is effective 
in modeling context and occlusion information in complex 
situations, and achieves better performance over state-of- 
the-art car detection methods and comparable performance 
on viewpoint estimation. 

There are two main limitations in our current imple¬ 
mentation. The first one is that we exploited the multi-car 
contextual patterns using 2-car composite only. In the sce¬ 
narios similar to street parking cars and parking lot cars, we 
could explore multi-car context with more than 2 spatially- 
aligned cars, as well as 3D scene parsing context | [70] |. The 
second one is that we utilized only the HOG features for 
appearance. Based on the recent progress on feature learning 
by convolutional neural network (CNN) |[^|, we can 
also substitute the HOG by the CNN features. Both aspects 
are addressed in our on-going work and may potentially 
improve the performance. 

Meanwhile, we are applying the proposed method to 
other object categories and studying different ways of min¬ 
ing contextual patterns and occlusion configurations (e.g., 
integrating with the And-Or quantization methods for 2D 
object modeling and 3D car modeling (47)). 
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